Skip to content

Improve search: multi-term AND + relevance ranking (FTS spike)#95

Open
rdhyee wants to merge 2 commits intoisamplesorg:mainfrom
rdhyee:feature/fts-spike
Open

Improve search: multi-term AND + relevance ranking (FTS spike)#95
rdhyee wants to merge 2 commits intoisamplesorg:mainfrom
rdhyee:feature/fts-spike

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 9, 2026

Summary

Closes #84 — FTS spike complete with immediate search improvements and documented future path.

Shipped now (zero new dependencies):

  • Multi-term search: "pottery Cyprus" requires BOTH words to match (was OR on the full phrase)
  • Relevance ranking: results sorted by score when searching — label match = 3pts, place = 2pts, description = 1pt
  • When not searching, results remain random for exploration variety

FTS spike findings:

  • Built offline DuckDB FTS index with tools/build_fts_index.py
  • Full index (label + description + place_name): 358 MB — too large for auto-download
  • Lite index (label + place_name only): 211 MB — still substantial
  • BM25 scoring works well (Porter stemming, English stopwords)
  • ATTACH over HTTP in DuckDB-WASM is supported but downloading 200-358 MB is impractical

Recommended next steps (not in this PR):

  1. Explore pre-tokenized search parquet (inverted index as parquet, much smaller)
  2. Consider on-demand FTS loading behind an "Enhanced Search" toggle
  3. Evaluate DuckDB text analytics functions (stemming without full index)

Test plan

  • Search "pottery" → results ranked by relevance (label matches first)
  • Search "pottery Cyprus" → only samples matching BOTH words
  • Search "basalt" → geological samples with label matches at top
  • Clear search → results return to random sampling
  • Verify tools/build_fts_index.py runs successfully with local parquet

🤖 Generated with Claude Code

Search improvements (immediate):
- Multi-term search: "pottery Cyprus" requires BOTH words to match
- Relevance ranking: label matches weighted 3x, place 2x, description 1x
- Results sorted by relevance score when searching (random for browsing)

FTS spike (future path, documented):
- Added tools/build_fts_index.py to build DuckDB FTS index offline
- Tested: 358 MB full index, 211 MB lite — too large for auto-download
- BM25 scoring works correctly (Porter stemming, stopwords)
- Next step: explore smaller index strategies or on-demand loading

Closes isamplesorg#84 (spike complete — findings documented in PR)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search input was passed into ILIKE patterns with only single-quote
escaping, so a literal "%" or "_" in the query (e.g. "100%", "co_op")
silently turned into wildcards. Escape % _ \ and add ESCAPE '\' in
both whereClause and the relevance-score expression.

Also reframe tools/build_fts_index.py as a spike artifact: the
docstring told readers to upload the index to data.isamples.org, but
per PR isamplesorg#95 findings the 200-358 MB result is too large to ship. Mark
the script NOT in production pipeline and drop the misleading upload
instructions.

Smoke-tested locally with /tmp/explorer_smoke_test.py (multi-term
"pottery cyprus" + wildcard "100%"): 0 JS exceptions, 0 console
errors, 0 failed requests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 28, 2026

Reviewed and pushed two small follow-ups (134aca2):

1. ILIKE wildcard escaping. Search input was passed into the ILIKE pattern with only single-quote escaping, so literal % or _ in the query (e.g. 100%, co_op) silently became wildcards. Now escape % _ \ and add ESCAPE '\' in both the whereClause block and the relevance-score expression.

2. FTS spike script header. tools/build_fts_index.py told readers to "upload to data.isamples.org" but per the PR's own findings the 200-358 MB result is too large to ship. Reframed as STATUS: spike artifact — NOT in production pipeline, kept the script for future revisits, dropped the misleading upload instructions.

Smoke test (/tmp/explorer_smoke_test.py against local Quarto render):

Serving docs on :64856
URL: http://127.0.0.1:64856/tutorials/isamples_explorer.html
JS exceptions:    0
Console errors:   0
Failed requests:  0
RESULT: PASS

Exercised: initial load, multi-term search (pottery cyprus), wildcard-char search (100%). Screenshot confirms the new placeholder and that 100% no longer matches everything.

Other notes from review (not blocking):

  • Score expression has discrete plateaus (0/1/2/3/5/6 per term); ties break alphabetically on label. Fine for spike — could mention in placeholder docs later.
  • description ILIKE over the wide parquet over HTTP range-fetch may add first-search latency; worth a ?perf=1 measurement before declaring search "done", but out of scope here.

LGTM to merge once you've eyeballed the diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore DuckDB FTS extension for full-text search in Explorer

1 participant